Text categorization using topic model and ontology networks

نویسندگان

  • Yinghao Huang
  • Xipeng Wang
چکیده

Text categorization based on pre-defined document categories is one of the most crucial tasks in text mining applications in recent decades. Successful text categorization highly relies on the text representations generated from documents. In this paper, an innovative text categorization model, VSM_WN_TM, is presented. VSM_WN_TM is a special Vector Space Model (VSM) that incorporates word frequencies, ontology networks and latent semantic information. Unlike the traditional text representation using only Bag-of-words (BOW) features, it incorporates semantic and syntactic relationship among words such as synonymy, co-occurrence and context, with the purpose of providing more inclusive and accurate text representation. Support Vector Machine is used as document classifier, and the proposed system is evaluated on three publicly available datasets and one domain-specific dataset. Experiment result shows that our approach significantly improves text classification by outperforming approaches such as using only latent features and traditional VSM approaches. Keywords— Text Categorization, vector space model, ontology network, topic model, support vector machine.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...

متن کامل

An Improved Approach for Topic Ontology Based Categorization of Blogs Using Support Vector Machine

Problem statement: Information search, collection and categorization from the blogosphere are still one of the important issues to be resolved. Mainly, the blogs assist the variety of interesting and useful information. Because of its increasing growth, blogs can not be categorized effectively. Therefore it is difficult to find relevant topics from the blogs. Hence blogs need to be categorized ...

متن کامل

Text Understanding from Scratch

This article demonstrates that we can apply deep learning to text understanding from characterlevel inputs all the way up to abstract text concepts, using temporal convolutional networks(LeCun et al., 1998) (ConvNets). We apply ConvNets to various large-scale datasets, including ontology classification, sentiment analysis, and text categorization. We show that temporal ConvNets can achieve asto...

متن کامل

SEWISE: An Ontology-based Web Information Search Engine

Since the begin of the 90's, the World Wide Web (WWW) rapidly guides the world into a newly amazing electronic village, where everybody can publish everything in electronic form and find almost all required information. The volume of available information is increasing exponentially in different formats, 80% being text. It remains hard to find interesting information directly from Web sources. ...

متن کامل

A Knowledge-based Topic Modeling Approach for Automatic Topic Labeling

Probabilistic topic models, which aim to discover latent topics in text corpora define each document as a multinomial distributions over topics and each topic as a multinomial distributions over words. Although, humans can infer a proper label for each topic by looking at top representative words of the topic but, it is not applicable for machines. Automatic Topic Labeling techniques try to add...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014